White Wine Quality Analysis By Lauren Green

This analysis explores a dataset containing chemical attributes and quality for approximately 4,900 white variants of the Portuguese “Vinho Verde” wine. The objective of this analysis is to determine which physiochemical properties affect wine quality.

The following describes the attributes of the wine, taken from Cortez et al., 2009 [1].

The following are input variables that are based on physiochemical tests:

1 - fixed acidity (tartaric acid - g/dm^3): most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity (acetic acid - g/dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid (g / dm^3): found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar (g/dm^3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides (sodium chloride - g/dm^3): the amount of salt in the wine

6 - free sulfur dioxide (mg/dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide (mg/dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density (g/cm^3): the density of wine is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates (potassium sulphate - g / dm3): a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial agent and antioxidant.

11 - alcohol (% by volume): the percent alcohol content of the wine

The output variable is based on sensory data and a combination of the input variables:

12 - quality: score between 0 (very bad) and 10 (very excellent)

Univariate Plots Section

Let’s first look at the structure of the wine dataframe.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

The wine dataset consists of 12 variables (X is a sequential count for each observation, so it was removed) with 4,898 observations. Now let’s have a look at the 5 number summary for the variables in the wine dataset.

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Summary Observations

  • The fixed acidity has a large range, varying from 3.8 g/dm^3 to 14.2 g/dm^3, with a median of 6.8 g/dm^3
  • The residual sugar varies from 0.6 g/dm^3 to 65.8 g/dm^3, with a median of 5.2 g/dm^3
  • The free sulfur dioxide and total sulfur dioxide both have very large range
  • The alcohol content varies from 9 vol % to 14.2 vol %, with a median of 10.4 vol %

Quality is the output variable, thus we are interested in how the input variables affect the wine quality.

We can investigate the distribution of white wine quality by plotting a bar graph Here we will see a distribution of quality from 0 - 10 where 0 is very bad quality and 10 is very excellent quality wine. From the 5 number summary above for quality, we can see that the minimum quality is 3 while the maximum quality is 10.

The distribution of the quality for white wine is normally distributed, the mean (5.8) and median (6) values are close to one another.

Let’s plot a bar graph for the quality of wine and categorize the quality of white wine as follows: - Bad: 0 - 4 - Average: 5 - 6 - Excellent: 7 - 10

A column named ‘level’ was added to the wine dataframe containing the wine quality levels: bad, average and excellent.

## 'data.frame':    4898 obs. of  13 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ level               : chr  "average" "average" "average" "average" ...

The majority of the wine is average quality, there are few bad quality wines in this dataset.

Let’s now look at histograms for the input variables.

Input Variable Observations

  • Fixed acidity, volatile acidity, citric acid, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH all follow a close to normal distribution. The mean (red line) and median (summary data) for these variables are fairly close which is an indication of following a normal distribution.
  • Residual sugar appears bi-modal
  • Sulphates appears somewhat close to a normal distribution, with the mean being slightly greater than the median indicating that the graph is slightly right-skewed.
  • The distribution for the alcohol volume content is unclear

Let’s do some transformations on the residual sugar.

Now let’s transformation the alcohol data.

When we transform the residual sugar using a log scale for the x-axis, the graph clearly shows a bi-modal distribution. After transforming the x-axis for the alcohol data, the graph is now somewhat closer to a rectangular distribution or may even display a tri-modal distribution.

Univariate Analysis

What is the structure of your dataset?

The wine dataset originally consisted of 12 variables (X is a sequential count for each observation, hence it was removed) with 4,898 observations. 11 of these variables are input variables that are physiochemical properties of the wine. There is one output variable: quality.

A new column was added to the wine dataset called ‘level’ which was a categorical measure of wine quality. Wines considered ‘bad’ quality had quality ratings between 0 and 4, ‘average’ wine quality had quality ratings between 5 and 6 and wines with a quality rating between 7 and 10 were considered to be ‘excellent’ quality.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is quality (the output variable); quality is based on sensory properties (such as taste, smell and sight) and these properties are affected by the physiochemical properties of the wine (input variables).

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Most input variables had a normal distribution. It is hard to say which variables will have an impact on the wine quality at this stage in the investigation.

The input variables that were presumed to affect the wine quality are: - citric acid as this affects the taste/ flavor of wine - residual sugar as this affects wine sweetness - pH as it is crucial to the taste of wine - alcohol as this may influence how wine quality - chlorides as it is the amount of salt in the wine

Did you create any new variables from existing variables in the dataset?

The ‘level’ variable, based on the wine quality variable was created.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Both the residual sugar and alcohol data displayed unusual behaviour. These two variables were transformed using a log scale for the x-axis to better understand the distribution. After the transformation, the residual sugar displayed a bi-modal distribution. After transforming the x-axis for the alcohol data, distribution appeared to be closer to a normal distribution or perhaps even a bi-modal distribution.

A bar graph was also plotted using the quality levels and it was found that most wines were of ‘average’ quality with very few ‘bad’ and ‘excellent’ quality wines. This may be challenging to build a predictive model with this data as our sample mainly consists of ‘average’ quality wine with very few ‘bad’ and ‘excellent’ wines.

Bivariate Plots Section

Let’s first analyse the correlation coefficients between the variables using a scatterplot matrix:

We can see the correlation coefficients between the variables from the scatterplot matrix, it would be better to view them with the most significant correlations emphasized using a colour scheme. This was done below.

It is not immediately clear from scatterplot matrix which correlations are the strongest. The correlation matrix plot however easily depicts which correlation coefficient is highest and lowest by use of colour.

Here we see that a few variables negatively affect the wine quality: - density (-0.31) - chlorides (-0.21) - volatile acidity (-0.19) - total sulfur dioxide (-0.18) - fixed acidity (-0.11) - residual sugar (-0.1)

The following variables positively affect the wine quality: - alcohol (0.44) - pH (0.1) - sulphates (0.05)

The correlations that also stand out from the correlation matrix are that of: - residual sugar and density (0.84) - alcohol and density (-0.78) - free sulfur dioxide and total sulfur dioxide (0.61) - density and total sulfur dioxide (0.53) - alcohol and residual sugar (-0.45)

Let’s investigate how the alcohol, density, chlorides and total sulfur dioxide affects wine quality, using boxplots to investigate these trends.

For bad quality wine, the alcohol content is just over 10 % by volume, then for average quality of 5, the alcohol content decreases and then increases from quality 6 to excellent quality (7 - 9).

The median density is fairly steady for bad quality wines (just under 0.995), then increases slightly for wine quality level of 5 and then gradually decreases as the quality level increases from average to excellent.

Chlorides seem to decrease steadily as quality increases.

Here we see that the median for the free sulfur dioxide is lower for bad quality wines (~ 20 mg/dm^3) and fairly consistent for average and excellent quality wines (30 - 35 mg/dm^3). The total sulfur dioxide is lower for bad quality wine, slightly higher for average quality and then decreases again for excellent quality wine. The free sulfur dioxide is responsible for preserving the wine, thus perhaps a wine with a lower free sulfur dioxide content may result in a wine that is not as fresh. Excessive amounts of total sulfur dioxide can inhibit fermentation as well as cause undesirable sensory flavour.

Let’s also have a look at the input variables that were strongly correlated using scatterplots, first we we will look at the relationship for the residual sugar and density.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$density and wine$residual.sugar
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

The residual sugar is very strongly positively correlated with density, with a correlation coefficient of 0.84. A red linear trend-line is plotted for the data.

Let’s now look at a scatterplot for alcohol and density.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$density and wine$alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376

Alcohol is strongly negatively correlated with density, with a correlation coefficient of -0.78.

Next let’s investigate the relationship between for free sulfur dioxide and total sulfur dioxide.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$total.sulfur.dioxide and wine$free.sulfur.dioxide
## t = 54.645, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5977994 0.6326026
## sample estimates:
##      cor 
## 0.615501

There is a strong, positive correlation between the free sulfur dioxide and total sulfur dioxide, the correlation coefficient of -0.62. This is because free sulfur dioxide constitutes as a portion of total sulfur dioxide, with the remainder being bound sulfur dioxide.

total SO2 = free SO2 + bound SO2 [5]

The free SO2 portion (not associated with wine molecules) essentially acts as a buffer against microbes and oxidation. Alternatively, the bound SO2 portion (which are sulfites bound to molecules such as sugars, acetaldehyde or phenolic compounds) has already done its work and is no longer useful as a preservative [5]. Sulfur dioxide levels need to be carefully regulated as not only does excess sulfur dioxide result in an unpleasant taste, they are also allergens and can be harmful to people in excess.

Next we investigated the relationship of density and total sulfur dioxide.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$density and wine$total.sulfur.dioxide
## t = 43.719, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5094349 0.5497297
## sample estimates:
##       cor 
## 0.5298813

There is a moderate positive correlation between density and free total dioxide, the correlation coefficient of 0.53.

Finally the relationship between alcohol and residual sugar was investigated.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$residual.sugar
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4726723 -0.4280267
## sample estimates:
##        cor 
## -0.4506312

There is a moderate negative correlation between alcohol and residual sugar, the correlation coefficient of -0.45.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The correlation matrix plot depicted that the following variables positively affect the wine quality: - alcohol (0.44) - pH (0.1) - sulphates (0.1)

Here we saw that alcohol had the biggest effect on wine quality, with higher quality wines having a higher alcohol content.

The following variables were shown to negatively affect the wine quality from the correlation matrix: - density (-0.31) - chlorides (-0.21) - volatile acidity (-0.2) - total sulfur dioxide (-0.18) - fixed acidity (-0.11) - residual sugar (-0.1)

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

The correlations that also stand out from the correlation matrix as well as the scatterplot investigation are the following: - residual sugar and density (0.84) - alcohol and density (-0.78) - free sulfur dioxide and total sulfur dioxide (0.61) - density and total sulfur dioxide (0.53) - alcohol and residual sugar (-0.45)

What was the strongest relationship you found?

The strongest relationship was that between residual sugar and density, the higher the residual sugar (the remaining sugar after fermentation), the higher the density. This makes sense as when solid sugar is mixed with water, it dissolves and becomes part of the sugar-water solution, increasing the density of the solution as more sugar is added.

Alcohol had the greatest effect on wine quality, with better quality wines having a higher alcohol content.

Multivariate Plots Section

Both residual sugar and alcohol are strongly correlated with density. Thus these three variables will be investigated.

Holding residual sugar constant, we can see that as the alcohol content increases, the density decreases. This may be because wine is more dense than ethanol. The density of ethanol is 0.789 g/cm^3 at 20 degrees celsius [8]. The density of white wine is around 0.98 g/cm^3 [9], thus the higher the alcohol volume percentage, the lower the volume percentage of wine, thus the wine solution density decreases due the density of ethanol being less than that of white wine.

Next we are going to analyze how residual sugar and chlorides affect density across quality. A new variable ‘density_bucket’ was created to do this analysis.

Chlorides are sodium chloride (NaCl), in other words the amount of salt in the wine. Here we can see that for the majority of data for each quality level, the higher the chlorides in the wine, the higher the density. When solid sodium chloride is added to a liquid solution, the molecules get closer together (via intermolecular forces) and the density increases. We also see that as the quality of the wine increases, the amount of chlorides decreases gradually.

The residual sugar is the amount of sugar left over after fermentation as the fermentation process consumes sugars to create ethanol as well as carbon dioxide as a by-product. From the graph above, the higher the residual sugar content in the wine, the higher the density of the wine. This is the same argument as for the sodium chloride above, there is a much clearer trend with the residual sugar and density than that of the chlorides.

Alcohol content and density were highly correlated with each other (Pearson’s r of -0.78), these variables were also the most correlated with quality. Thus the alcohol content and density will be investigated with the variable that was created earlier ‘level’ which is the wine quality level of ‘bad’, ‘average’ or ‘excellent’.

The slope for the excellent quality wine is the steepest. This means that with the same change in density, the change/ difference in alcohol content will be greater than that of the average and bad quality wine.

Let’s add contour lines onto the figure above and remove the scatter points.

Let’s investigate this plot above by using a 2D density plot. In the contour plot we can’t see the average quality layer very well, this should be more visible in the following plot.

These two plots seem to show the same relationship.

Lastly, a linear model for quality was created.

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = wine)
## m2: lm(formula = I(quality) ~ I(alcohol) + density, data = wine)
## m3: lm(formula = I(quality) ~ I(alcohol) + density + chlorides, data = wine)
## m4: lm(formula = I(quality) ~ I(alcohol) + density + chlorides + 
##     volatile.acidity, data = wine)
## m5: lm(formula = I(quality) ~ I(alcohol) + density + chlorides + 
##     volatile.acidity + total.sulfur.dioxide, data = wine)
## 
## ==============================================================================================
##                              m1            m2            m3            m4            m5       
## ----------------------------------------------------------------------------------------------
##   (Intercept)               2.582***    -22.492***    -21.150***    -35.573***    -30.759***  
##                            (0.098)       (6.165)       (6.162)       (6.010)       (6.295)    
##   I(alcohol)                0.313***      0.360***      0.343***      0.389***      0.391***  
##                            (0.009)       (0.015)       (0.015)       (0.015)       (0.015)    
##   density                                24.728***     23.671***     38.217***     33.251***  
##                                          (6.079)       (6.074)       (5.926)       (6.234)    
##   chlorides                                            -2.382***     -1.300*       -1.370*    
##                                                        (0.558)       (0.542)       (0.543)    
##   volatile.acidity                                                   -2.043***     -2.070***  
##                                                                      (0.111)       (0.111)    
##   total.sulfur.dioxide                                                              0.001*    
##                                                                                    (0.000)    
## ----------------------------------------------------------------------------------------------
##   R-squared                 0.190         0.192         0.195         0.248         0.249     
##   adj. R-squared            0.190         0.192         0.195         0.247         0.248     
##   sigma                     0.797         0.796         0.795         0.768         0.768     
##   F                      1146.395       583.290       396.315       402.956       324.034     
##   p                         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -5839.391     -5831.127     -5822.011     -5657.292     -5654.027     
##   Deviance               3112.257      3101.773      3090.247      2889.234      2885.385     
##   AIC                   11684.782     11670.255     11654.021     11326.584     11322.054     
##   BIC                   11704.272     11696.241     11686.504     11365.563     11367.530     
##   N                      4898          4898          4898          4898          4898         
## ==============================================================================================

This is not a good model to predict quality, with R-squared values from 0.19 - 0.249.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Both residual sugar and alcohol are strongly correlated with density. Thus these three variables were investigated. Holding residual sugar constant, we saw that as the alcohol content increases, the density decreases. This may be because wine is more dense than ethanol.

Next we looked at how residual sugar and chlorides affect density. A new variable ‘density_bucket’ was created to do this analysis. It was found that when the content of either chlorides or residual sugar was higher, the higher the wine density. This is due to the solution becoming more dense when solids are added as the molecules get closer together.

Alcohol content and density were highly correlated with each other (Pearson’s r of -0.78), these variables were also the most correlated with quality. Thus the alcohol content and density was investigated with the variable that was created earlier ‘level’ which is the wine quality level of ‘bad’, ‘average’ or ‘excellent’. It was found that higher quality wines seem to have a higher alcohol content and a lower density.

Were there any interesting or surprising interactions between features?

It was interesting that the alcohol and density seemed to follow opposite trends, when the alcohol content decreases, the density increased and vice versa. This may be because wine is more dense than ethanol.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

A linear model was contructed for quality, however this model was not a good fit with R-squared values from 0.19 - 0.249. This may be because the model itself is inaccurate or perhaps because these variables vary across quality, thus a general model to predict wine quality is not possible. It also may be that there are variables that affect wine quality that were not present in the dataset and are perhaps also not easily quantifiable such as smell.


Final Plots and Summary

Plot One

Description One

A bar graph was plotted for the quality of wine which was categorized as follows: - Bad: 0 - 4 - Average: 5 - 6 - Excellent: 7 - 10

This stacked bar graph visually shows that the majority of the white wine is of average quality, there are very few bad quality wines in this dataset. The mode is 6 as it is the most occurring quality value (olive green portion is the largest), the median also happens to be six, with the mean being very close 5.8, which is an indication that the quality variable is normally distributed.

Plot Two

Description Two

The correlation coefficients between the variables were evaluated and the following was found:

Here we saw that a few variables negatively affect the wine quality: - density (r = -0.31) - chlorides (r = -0.21) - volatile acidity (r = -0.2) - total sulfur dioxide (r = -0.18) - fixed acidity (r = -0.11) - residual sugar (r = -0.1)

The following variables positively affect the wine quality: - alcohol (r = 0.44) - pH (r = 0.1) - sulphates (r = 0.1)

The correlations that also stand out from the correlation matrix are that of: - residual sugar and density (r = 0.84) - alcohol and density (r = -0.78) - free sulfur dioxide and total sulfur dioxide (r = 0.61) - density and total sulfur dioxide (r = 0.53) - alcohol and residual sugar (r = -0.45)

Plot Three

Description Three

For the boxplots for alcohol vs quality: For bad quality wine, the alcohol content is just over 10 % by volume, then for average quality of 5, the alcohol content decreases and then increases from quality 6 to excellent quality (7 - 9).

For the boxplots for density vs quality: The median density is fairly steady for bad quality wines (just under 0.995), then increases slightly for wine quality level of 5 and then gradually decreases as the quality level increases from average to excellent.

It’s interesting that the alcohol and density versus quality seem to follow opposite trends, when the alcohol content decreases, we see the density increase. The data points are also overlaid in both graphs in black as well as jittered for visibility. Here we see that there is more data available for the average quality wines.

Plot Four

Description Four

The residual sugar is very strongly positively correlated with density, with a correlation coefficient of 0.84. A red linear trendline is plotted for the data.

Alcohol is strongly negatively correlated with density, with a correlation coefficient of -0.78.

These variables were the most strongly correlated in the white wine dataset.

Plot Five

Description Five

Both residual sugar and alcohol are strongly correlated with density. Thus these three variables were investigated.

Holding residual sugar constant, we can see that as the alcohol content increases, the density decreases. This may be because wine is more dense than ethanol. The density of ethanol is 0.789 g/cm^3 at 20 degrees celsius [8]. The density of white wine is around 0.98 g/cm^3 [9], thus the higher the alcohol volume percentage, the lower the volume percentage of wine, thus the wine solution density decreases due the density of ethanol being less than that of white wine.

Plot Six

Description Six

Alcohol content and density were highly correlated with each other (Pearson’s r of -0.78), these variables were also the most correlated with quality. Thus the alcohol content and density was investigated with the variable that was created earlier ‘level’ which is the wine quality level of ‘bad’, ‘average’ or ‘excellent’.

The first plot is a scatter plot with an overlay of contour lines by quality level. The second plot contain the same variables but is a 2D density plot. The two plots above seem to show the same relationship. For average quality wine we cannot see the contour clearly but in the 2D density plot we see a peak at alcohol ~ 9.4 vol% and density ~ 0.45 g/cm^3. What is interesting are the three peaks for excellent quality wine, we see the three contours in the first plot which is corroborated by the peaks in the second plot. High quality wines seem to have a higher alcohol content and a lower density - in general the peaks are not as high for excellent quality wine as that of average and bad quality wine. This further supports the discussion from Plot 5.


Reflection

From the White Wine Quality Analysis the following was concluded:

In conclusion, alcohol content seemed to have the greatest effect on wine quality (the higher the alcohol content, the greater the wine quality) and not residual sugar as I initially thought, this had very little effect on wine quality. Average and excellent quality wines seemed to have a slightly higher free and total sulfur dioxide content, this adds to the freshness of the wine, however, in excess SO2 can produce a bad odour/ taste as well as be a health allergen.

From this investigation, it seems that predicting the wine quality based on these chemical properties proved to be challenging (the linear model was not a good predictor for wine quality). There may be other variables that were not quantified in this dataset such as grape type, climate, temperature, sunlight, soil, levels of tannins in the wine, aging process and so on that may have a greater effect on wine quality.

In future, it may be better to explore these trends by quality level, for this dataset we had mainly average quality wine, it would also be beneficial to have a larger dataset with more bad and excellent quality wines.

References

  1. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236

  2. https://stackoverflow.com/questions/6557977/how-do-i-add-the-mean-value-to-a-histogram-in-r/6558014#6558014 (add mean line to graph)

  3. http://www.talkstats.com/threads/adding-a-new-column-in-r-data-frame-with-values-conditional-on-another-column.30924/ (ifelse statement, creating level column)

  4. https://stackoverflow.com/questions/24895575/ggplot2-bar-plot-with-two-categorical-variables (stacked bar chart)

  5. https://winobrothers.com/2011/10/11/sulfur-dioxide-so2-in-wine/

  6. http://www.sthda.com/english/wiki/ggcorrplot-visualization-of-a-correlation-matrix-using-ggplot2 (correlation matrix)

  7. https://www.r-graph-gallery.com/264-control-ggplot2-boxplot-colors/ (fill histogram levels)

  8. https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/Ethanol_(data_page).html (ethanol density)

  9. http://web2.slc.qc.ca/jmc/w05/Wine/results.html

  10. https://stats.stackexchange.com/questions/31726/scatterplot-with-contour-heat-overlay (scatterplot with contour overlay)

  11. https://stackoverflow.com/questions/23675735/how-to-add-boxplots-to-scatterplot-with-jitter (add scatterplot to boxplot)

  12. https://stackoverflow.com/questions/12980081/create-a-stacked-density-graph-in-ggplot2 (stacked density graph)

  13. http://petewerner.blogspot.co.za/2012/12/density-plot-with-ggplot.html (stacked density graph)